True Asymptotic Natural Gradient Optimization
نویسنده
چکیده
We introduce a simple algorithm, True Asymptotic Natural Gradient Optimization (TANGO), that converges to a true natural gradient descent in the limit of small learning rates, without explicit Fisher matrix estimation. For quadratic models the algorithm is also an instance of averaged stochastic gradient, where the parameter is a moving average of a “fast”, constant-rate gradient descent. TANGO appears as a particular de-linearization of averaged SGD, and is sometimes quite different on non-quadratic models. This further connects averaged SGD and natural gradient, both of which are arguably optimal asymptotically. In large dimension, small learning rates will be required to approximate the natural gradient well. Still, this shows it is possible to get arbitrarily close to exact natural gradient descent with a lightweight algorithm. Let pθ(y|x) be a probabilistic model for predicting output values y from inputs x (x = ∅ for unsupervised learning). Consider the associated log-loss l(y|x) := − ln pθ(y|x) (1) Given a dataset D of pairs (x, y), we optimize the average log-loss over θ via a momentum-like gradient descent. Definition 1 (TANGO). Let δtk 6 1 be a sequence of learning rates and let γ > 0. Set v0 = 0. Iterate the following: • Select a sample (xk, yk) at random in the dataset D. • Generate a pseudo-sample ỹk for input xk according to the predictions of the current model, ỹk ∼ pθ(ỹk|xk) (or just ỹk = yk for the “outer product” variant). Compute gradients gk ← ∂l(yk|xk) ∂θ , g̃k ← ∂l(ỹk|xk) ∂θ (2) • Update the velocity and parameter via vk = (1− δtk−1)vk−1 + γgk − γ(1− δtk−1)(v⊤ k−1 g̃k)g̃k (3) θk = θk−1 − δtkvk (4)
منابع مشابه
Non-Asymptotic Convergence Analysis of Inexact Gradient Methods for Machine Learning Without Strong Convexity
Many recent applications in machine learning and data fitting call for the algorithmic solution of structured smooth convex optimization problems. Although the gradient descent method is a natural choice for this task, it requires exact gradient computations and hence can be inefficient when the problem size is large or the gradient is difficult to evaluate. Therefore, there has been much inter...
متن کاملMomentum and Optimal Stochastic Search
The rate of convergence for gradient descent algorithms, both batch and stochastic, can be improved by including in the weight update a “momentum” term proportional to the previous weight update. Several authors [1, 2] give conditions for convergence of the mean and covariance of the weight vector for momentum LMS with constant learning rate. However stochastic algorithms require that the learn...
متن کاملSequential Convex Approximations to Joint Chance Constrained Programs: A Monte Carlo Approach
When there is parameter uncertainty in the constraints of a convex optimization problem, it is natural to formulate the problem as a joint chance constrained program (JCCP) which requires that all constraints be satisfied simultaneously with a given large probability. In this paper, we propose to solve the JCCP by a sequence of convex approximations. We show that the solutions of the sequence o...
متن کاملApproximate Joint Diagonalization Using a Natural Gradient Approach
We present a new algorithm for non-unitary approximate joint diagonalization (AJD), based on a “natural gradient”-type multiplicative update of the diagonalizing matrix, complemented by step-size optimization at each iteration. The advantages of the new algorithm over existing non-unitary AJD algorithms are in the ability to accommodate non-positive-definite matrices (compared to Pham’s algorit...
متن کاملBayesian Learning via Stochastic Gradient Langevin Dynamics
In this paper we propose a new framework for learning from large scale datasets based on iterative learning from small mini-batches. By adding the right amount of noise to a standard stochastic gradient optimization algorithm we show that the iterates will converge to samples from the true posterior distribution as we anneal the stepsize. This seamless transition between optimization and Bayesi...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- CoRR
دوره abs/1712.08449 شماره
صفحات -
تاریخ انتشار 2017